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' Abstract. We prove a Theorem justifying the regularity conditions which 

. are needed for Path Sampling in Factor Models. We then show that the 

remaining ingredient, namely MCMC for calculating the integrand at 
each point in the path, may be seriously flawed, leading to wrong es- 
CN ■ timates of Bayes Factors. We provide a new method of Path Sampling 

(with Small Change) that works much better than standard Path Sam- 
| pling in the sense of estimating the Bayes Factor better and choosing 

• the correct model more often. When the more complex Factor model 

is true, PS-SC is substantially more accurate. New MCMC diagnostics 
is provided for these problems in support of our conclusions and rec- 
ommendations. Some of our ideas for diagnostics and improvement in 
computation through small changes should apply to other methods of 
| computation of Bayes Factor for model selection. 

^ . Key words and phrases: Bayes Model Selection, Covariance Models, 

^ | Path Sampling, Laplace Approximation. 

! 1. BAYES MODEL SELECTION 

Advances in MCMC techniques to compute the posterior for many complex, 
hierarchical models have been a major reason for success in Bayes modeling and 
analysis of complex phenomena (Andrieu et. al. (2004)). These techniques along 
^ ' with applications are surveyed in numerous papers, including Chen et. al. (2000), 

■ Liu (2008) and Robert and Casella (2004). Moreover, many Bayesian books on 

applications or theory and methods provide a quick introduction to MCMC such 
as Gelman et. al. (2003), Ghosh et. al. (2006), Gamerman and Lopes (2006) and 
Lynch (2007). 

Just as the posterior for the parameters of a given model are important for 
calculating Bayes estimates, posterior variance, credibility intervals and a general 
description of the uncertainty involved, one needs to calculate Bayes Factors for 
selecting one of several models. Bayes factors are the ratio of marginals of given 
data under different models, when more than one model is involved and one wishes 
to choose one from among them, based on their relative or posterior probability. 
The ratio of marginals measures the relative posterior probability or credibility 
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of one model with respect to the other if we make the usual objective choice of 
half as prior probability for each model. 

Although there are many methods for calculating Bayes Factors, their success 
in handling complex modern models is far more limited than seems to be generally 
recognized. Part of the reason for lack of awareness of this is that model selection 
has become important relatively recently. Also one may think that, in principle, 
calculation of a BF can be reduced to the calculation of a posterior and hence 
solvable by the same methods as those for calculating the posterior. Reversible 
Jump MCMC (RJMCMC) is an innovative methodology due to Green (1995), 
based on this simple fact. However, two models essentially lead to two different 
sets of states for any Markov chain that connects them. The state spaces for 
different models often differ widely in their dimension. This may prevent good 
mixing and may show up in the known difficulties of tuning RJMCMC. For a 
discussion of tuning difficulties see Robert and Casella (2004). 

The other popular method for calculating BF is path sampling (PS), which 
is due to Gelman and Meng (1998), and recently re-examined by Lefebvre et. al. 
(2009). Our major goal is to explore PS further in the context of nested, rel- 
atively high dimensional covariance models, rather than non-nested low dimen- 
sional mean models, as in the last reference. The new examples show both similar- 
ities and sharp changes from the sort of behavior documented in Lefebvre et. al. 
(2009). 

We consider three paths, namely, the geometric mean path, the arithmetic 
mean path and the parametric arithmetic mean path, which appear in Gelman and Meng 
(1998), Lefebvre et. al. (2009), Ghosh and Dunson (2008), Ghosh and Dunson 
(2009), Lee and Song (2002) and Song and Lee (2006). Our priors are usually the 
diffuse Cauchy priors, first suggested by Jeffreys (1961) and since then recommeded 
by many others, including Berger (personal communication), Liang et. al. (2008), 
Gelman (2006) and Ghosh and Dunson (2009). But we also examine other less 
diffuse priors too, going all the way to normal priors. Since Lefebvre et. al. (2009) 
have studied applications of PS to mean like parameters, we focus on covariance 
models. We restrict ourselves generally to factor models for covariance, which 
have become quite popular in recent applications, e.g. Lopes and West (2004), 
Ghosh and Dunson (2008), Ghosh and Dunson (2009) and Lee and Song (2002). 
The recent popularity of factor models is due to the relative ease with which 
they may be used to provide a sparse representation of the covariance matrix 
of multivariate normal data in many applied problems of finance, psychometry 
and epidemiology, see for example the last three references. Also, often it leads 
to interesting scientific insight, see Bartholomew et. al. (2002). 

In addition to prior, likelihood and path, there are other choices to be made 
before PS can be implemented, namely, a method of discretizing the path, e.g., 
by equispaced points or adaptively (Lefebvre et. al. (2009)) and how to integrate 
the score function of Gelman and Meng (1998) at each point in the discrete path. 
A popular method is to use MCMC. These more technical choices are discussed 
later in the paper. Along with PS we will consider other methods like Importance 
Sampling (IS) and its descendents like Annealed Importance Sampling (AIS), due 
to Neal (2001), and Bridge Sampling (BS), due to Meng and Wong (1996). 

We now summarize our contribution in this paper. 

In Section 2 we review what is known about Path Sampling and Factor Models. 
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We introduce Factor Models, a suitable path and suitable diffuse t-priors. The 
path we use was first introduced in Gelman and Meng (1998) for mean models 
and by Lee and Song (2002) and Ghosh and Dunson (2009) for Factor Models. 

In Subsection 2.4 we prove a theorem (Theorem 2.1) which essentially shows 
that except for the convergence of MCMC estimates for expected score function 
Et(U(9,t)) at each grid point t in the path, all other needed conditions for PS 
will hold for our chosen path, prior and likelihood for Factor Models. In one of the 
Remarks following the Theorem we generalize this result to other paths. Remark 
3 points to the need for some finite moments for the prior not just for Theorem 
2.1 to hold but for the posterior to behave well. Then in Remark 5 we provide a 
detailed, heuristic argument as to why the MCMC may fail dramatically by not 
mixing properly if the data has come from the bigger of the two models under 
consideration. If our heuristics is correct, and there is a small interval where 
Et(U(9,t)) oscillates most, then a grid size that is a bit coarse will not only be a 
bit inaccurate, it will be very wrong. Even if the grid size is sufficiently small, one 
will need to do MCMC several times with different starting points just to realize 
PS will not work. Our new proposal avoids these problems but will require more 
time if many models are involved. 

In Section 3, we give an argument as to why the above is unlikely to be true 
if the data has come from the smaller model. More importantly, in Subsection 
3.3 we propose a modification of PS, which we call Path Sampling with Small 
Change (PS-SC) which is expected to do better. 

Implementation of PS and PS-SC can be very time consuming due to the need 
of MCMC sampling for each grid point along the path. Time can be saved if we can 
implement PS and PS-SC by parallel computation, as noted by Gelman and Meng 
(1998). 

In Subsection 3.4 we show MCMC output for the various cases discussed and 
validate our heuristics above. The diagnostics via projection into likelihood space 
should prove useful for other model selection problems. Our gold standard is PS- 
SC, based on an MCMC with the number of draws m=50,000 and burn in of 
1,000, if necessary. But actually in our examples m=6,000 and burn in of 1000 
suffice for PS-SC. For other model selection rules also we go up to m=50,000 if 
necessary. After Subsection 3.4, having shown our modified PS, namely PS-SC, 
is superior to PS under both models, we do not consider PS in the rest of the 
paper. 

In the last two Sections we touch on the following related topics: effects of 
grid size, alternative path, alternative methods and performance of PS-SC and 
some other methods in very high dimensional simulated and real examples. PS-SC 
seems to choose the true models in the simulated cases and relatively conservative 
models for real data. In Section 5 we explore various real life and high dimensional 
factor models, with the object of combining PS-SC with two of the methods which 
do relatively well in Section 4 to reduce the time of PS-SC in problems with the 
number of factors rather high say 20 or 26, for which PS-SC can be quite slow. 

In Appendix A.l we introduce briefly a few other methods like Annealed Im- 
portance Sampling (AIS) which we have compared with PS-SC. Finally, Appendix 
A. 2 points to some striking differences between what we observe in Factor Mod- 
els and what one might have expected from our familiar classical asymptotics for 
maximum likelihood estimates. Of course, as pointed out by Drton (2009), classi- 
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cal asymptotics does not apply here but it surprised us that the differences would 
be so stark. It is interesting and worth pointing out that the Bayes methods like 
PS-SC can be validated partly theoretically and partly numerically inspite of a 
lack of suitable asymptotic theory. 

2. PATH SAMPLING AND FACTOR MODELS 

In the following subsections, we review some basic facts about PS, including 
definition of the three paths and the notion of an optimal path. More importantly, 
since our interest would be in model selection for covariance rather than mean, 
we introduce Factor models and then PS for Factor Models in Subsection 2.3 and 
2.4. 

Subsection 2.1 is mostly an introduction to PS and reviews previous work. After 
that we show the failure of PS-estimates in a toy problem related to the modelling 
of the covarince matrix in Subsection 2.2. In Subsection 2.3 we introduce Factor 
Models and our priors. Subsection 2.4 introduces paths that we consider for Factor 
Models and a Theorem showing the regularity conditions needed for validity of 
PS under Factor Models. Then in a series of remarks we extend the theorem and 
also study and explain how the remaining ingredient of PS, namely MCMC, can 
go wrong. We show a few MCMC outputs to support our arguments in Subsection 
3.4. This particular theme is very important and will come up several times in 
later sections or subsections where related different aspects will be presented. 

2.1 Path Sampling 

Among the many different methods related to importance sampling, the most 
popular is path sampling (PS). However PS is best understood as a limit of the 
simpler bridge sampling (BS) (Gelman and Meng (1998)). So we first begin with 
BS. 

It is well-known that unless the densities of the sampling and target distri- 
butions are close, Importance Sampling (IS) will have high variance as well as 
high bias. Due to the difficulty of finding a suitable sampling distribution for IS, 
one may try to reduce the difficulty by introducing a non-normalized interme- 
diate density fxj2 that acts like a bridge between the non- normalized sampling 
density /i and non-normalized target density /o (Meng and Wong (1996)). One 
can then use the identity Z\/Zq = and estimate both the numerator and 

denominator by IS. Extending this idea, Gelman and Meng (1998) constructed 
a whole path f t :t€ [0, 1] connecting /o and f\. This is also like a bridge. Dis- 

cretizing this they get the identity Z\/Zq = n^=i ~^r^^f2T^ ' wn id 1 leads to a 
chain of IS estimates in the numerator and denominator. We call this estimate 
the Generalized Bridge Sampling (GBS) estimate. 

More importantly, Gelman and Meng (1998) introduced PS, which is a new 
scheme, using the idea of a continuous version of GBS but using the log scale. 
The PS estimate is calculated by first constructing a path as in BS. Suppose the 
path is given by pt : t £ [0, 1] where for each t, pt is a probability density. Then 
we have the following identity. 

(2.1) Pt(0) = -ft(0), 
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where ft is an unnormalized density and zt = J ft(6)d0 is the normalizing con- 
stant. Taking the derivative of the logarithm on both sides, we obtain the follow- 
ing identity under the assumption of interchangeability of the order of integration 
and differentiation: 

(2.2) j t log{z t ) = J L±f t (0)^dB) = E t [| log f t {9)\ = E t [U(9,t)} 

where the expectation E t is taken with respect to pt{9) and U (9, t) = A log ft{0). 
Now integrating (2.2) from to 1 gives the log of the ratio of the normalizing 
constants, i.e. log BF in the context of model selection: 

(2.3) log&= f 1 E t [U(0,t)]dt 

A) •'0 

To approximate the integral we discretize the path with k points i(o) = < tm < 
. . . < t/fy = 1 and draw m MCMC samples converging to pt(0) at each of these 
k points. Then estimate E t [U{9,t)} by ^Y.U{9 {i \t) where 9® is the MCMC 
output. To estimate the final log Bayes factor, commonly numerical integration 
schemes are used. It is clear that MCMC at different points "t" on the path can 
be done in parallel. We have used this both for PS and for our modification of it, 
namely PS-SC introduced in Subsection 3.3. 

Gelman and Meng (1998) showed there is an optimum path in the whole dis- 
tribution space providing a lower bound for MCMC variance, namely 

where fo and f\ are the densities corresponding the two models compared and 
H(fo, fi) is their Hellinger distance. Unfortunately in nested examples /o and f\ 
are mutually orthogonal, so H(f$, f\) takes the trivial value of two. Moreover m 
is so large that the lower bound becomes trivial and unattainable. However in a 
given problem, one path may be more suitable or convenient than another. 

Following Gelman and Meng (1998) and Lefebvre et. al. (2009), we consider 
three paths generally used for the implementation of PS. Geometric Mean Path 
(GMP) and Arithmetic Mean Path (AMP) are defined by the mean (f t = / (1_t) /i 
and ft = tfo + (1 — i)/i respectively) of the densities of two competing models 
for each model Mt : t S (0, 1) along the path. Our notation for Bayes Factor is 
given later in Equation 2.6. 

One more common path is obtained by assuming a specific functional form fg 
for the density and then constructing the path in the parametric space (9 G O) of 
the assumed density. If 9t = t9o+(l—t)9i, then f t fi t is the density of the model M^, 
where fo^ = fo and fi g 1 = f\. We denote this third path as Parametric Arith- 
metic Mean Path (PAMP). The PAMP path was shown by Gelman and Meng 
(1998) to minimize the Rao distance in a path for model selection about normal 
means. More importantly it is very convenient for use of MCMC, as shown for 
some Factor models by Song and Lee (2006) and Ghosh and Dunson (2009), and 
for linear models by Lefebvre et. al. (2009). Implementation of PS with the paths 
mentioned above are denoted as GMP-PS, AMP-PS and PAMP-PS. In view of 
the discussion in Lefebvre et. al. (2009) regarding the degeneracy of the AMP-PS, 
we will only consider PAMP-PS and GMP-PS. 
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Unlike Lefebvre et. al. (2009), who study models about means, our interest 
is in studying model selection for covariance models, specifically Factor Models 
with different number of factors. These are discussed in the Subsections 2.3 and 
2.4. Performance of PS for covariance models can be very different from the 
examples in Lefebvre et. al. (2009). In the next subsection we give a toy example 
of covariance model selection where PS fails and our proposed modification PS-SC 
is also not applicable. 

2.2 Covariance Model : Toy Example 

To illustrate the difficulties in calculation of the BF that we discuss later, we 
begin by considering a problem where we can calculate the true value of the Bayes 
Factor. 

We assume Y p ~ iV(0, £), for some m < p we wish to test whether Y± ... m and 
Y m+ \ „ are independent or not. If £ = [ .} l ] where Y\ m ~ N(0,Au) 

' ' \ A 12 A 2 2 ) 

and l^n+i v .. iP ~ ^(0,^22)) then the competetive models for a fixed m will be, 
Mq : Ayi = vs Mi : Ayi 7^ 0. Under Mi we use a inverse- Wishart prior for the 
covariance matrix as it helps us to calculate the true BF, using the conjugacy 
property of the prior. Under Mq we take An, A22 to be independent, each with 
a inverse Wishart prior. 

We illustrate the above problem with p=10 and m=7 for a positive definite 
f A A \ 

matrix S° = I . .q 1 ., .J 2 (given in Appendix A. 3). We implement the Path 
\\A12) ^22/ 

Sampling for this problem connecting Mq and Mi, using a Parametric Arithmetic 
Mean Path : 



(2.4) M l : tt ~ivfo.E=L^ ) , ^ 



12 
22 , 



For every < t < 1, the £ matrix is positive definite, being a convex combination 
of two positive definite matrices. For t=0 and t=l we get the models Mq and Mi. 

We can estimate the Bayes Factor by using the Path Sampling schemes as 
described earlier. We simulated two datasets one each from Mq and Mi and 
report the true BF value with the PS estimate in Table 1. Here the reported 
Bayes Factor is defined as the ratio 2£L, where mi and niQ are the marginals 
under the models Mi and Mq respectively. 



Method 


Data 1 


Data 2 


True BF value 
PS estimate of BF 


96.19 
73.91 (.012) 


-124 
-4.28 (.008) 



Table 1 

Performance of PS in Toy Example modelling Covariance 



The values in table show us that estimated BF value is off by an order of 
magnitude when Mq is true. The value is relatively stable as judged by the MCMC 
variance based on ten runs and near to the true value for Mi . We note in passing 
that our new proposal PS-SC which is introduced in Subsection 2.4 cannot be 
implemented. Also the explanation for failure of standard PS in Factor Models, 
as explained later in the paper, is not applicable to this toy example. 
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2.3 Factor Models and Bayesian Specification of Prior 

A factor model, with k factors is defined as n i.i.d. observed r.v.'s 

yi = Atji + ei, e» ~ N p (0, E), 

where A is a p x k matrix of factor loadings, 

Vi = (ViU ■ ■ • > Vik)' ~ A fc (0, 4) 

is a vector of standard normal latent factors, and is the residual with diagonal 
covariance matrix £ = diag{p\, . . . ,az). Thus we may write the marginal distri- 
bution of yi as Np(Q,£l), 0, = AA' + £. This model implies that the sharing of 
common latent factors explains the dependence in the outcomes and the outcome 
variables are uncorrelated given the latent factors. 

A factor model, without any other constraints, is non-identifiable under orthog- 
onal rotation. Post-multiplying A by an orthogonal matrix P, where P is such that 
PP' = Ifc, we obtain exactly the same O as in the previous factor model. To avoid 
this, it is customary to assume that A has a full-rank lower triangular structure, 
restricting the number of free parameters in A and E to q = p(k + l) — k(k — l)/2, 
where k must be chosen so that q < p(p + 1)/2. The reciprocal of diagonal entries 
of £ form the precision vector here. 

It is well-known that maximum likelihood estimates for parameters in fac- 
tor models may lie on boundaries and hence likelihood equations may not hold 
(Anderson (1984)). The Bayes estimate of £1 defined as average over MCMC 
outputs is well-defined, easy to calculate and, being average of positive definite 
matrices, is easily seen to be positive definite. This fact is used to search for max- 
imum likelihood estimate (mle) or maximum prior x likelihood estimates (mple) 
in a neighborhood of the Bayes estimate. 

We also note for later use the following well-known simple fact, e.g. Anderson 
(1984). If the likelihood is maximized over all positive definite matrices 0, not 
just over factor models, then the global maximum for n independent observations 
exists and is given by 

1 " 

(2.5) n = — rj^(yi-y)(yi-y)'- 

From the Bayes model selection perspective, a specification of the prior dis- 
tribution for the free elements of A and E is needed. Truncated normal priors 
for the diagonal elements of A, normal priors for the lower triangular elements, 
and inverse-gamma priors for erf , . . . , crH, have been commonly used in practice 
due to conjugacy and the resulting simplification in posterior distribution. Prior 
elicitation is not common. 

Ghosh and Dunson (2009) addressed the above problems by using the idea of 
Gelman (2006) to introduce a new class of default priors for the factor loadings 
that have good mixing properties. They used Gibbs sampling scheme and showed 
there was good mixing and convergence. They use parameter expansion to induce 
a class of t or folded-t priors depending on sign constraints on the loadings. 
Suitable t-priors have been very popular. We use the same family of priors but 
consider a whole range of many degrees of freedom going all the way to the normal 
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and use the same Gibbs sampler as in Ghosh and Dunson (2008). We have used 
a modified version of their code. 

In the factor model framework, we stick to the convention of denoting the 
Bayes factor for two models with latent factors h — 1 and h as 

(2.6) BF Kh . 



m h -i(x) 

where rrih(x) is the marginal under the model having h latent factors. So the 
Bayes Factor for simpler model ( defined as Mq ) and complex model ( defined as 
Mi) with h — 1 and h latent factors will be defined as BF^h-i- We choose the 
model with h and h — 1 latent factors respectively, depending on the value of log 
Bayes factor being positive and negative. Alternatively one may choose a model 
only when the value of log BF is decisively negative or positive; say less than or 
greater than a chosen threshold. 

2.4 Path Sampling for Factor Models 

There are several variants of Path Sampling, which have been explored in differ- 
ent setups, depending on choice of path, prior and other tuning parameters (grid 
size and MCMC sample size). In the Factor model setup the parametric arith- 
metic mean path (PAMP) (used by Song and Lee (2006) and Ghosh and Dunson 
(2009)) seems to be the most popular one. We also consider Geometric Mean 
Path (GMP) along with the PAMP for Factor Model. 

By constructing a GM path from corresponding prior to the posterior, we 
can estimate the value of the log-marginal under both Mq and Mi, which in turn 
leads us to an estimate of the log-BF. The Gibbs Sampling scheme for these paths 
with parameter-expansion (for different priors) is described in the Appendix. We 
will first describe the two paths and their corresponding score functions to be 
estimated along the path. 

i. Parametric Arithmetic Mean Path : Lee and Song (2002) used this 
path in factor models, following an example in Gelman and Meng (1998). 
Ghosh and Dunson (2008) also used this path along with parameter expan- 
sion. Here we define Mq and M\ to be the two models corresponding to 
the factor model with factors h — 1 and h, respectively and then connect 
them by the path M t : m = A t rji + e i} A t = (Ai, A 2 , . . . , X h -i, t\ h ) where Xj 
is the j-th column of the loading matrix. So for t=0 and t=l we get the 
models Mq and M\. The likelihood function at grid point t is a MVN which 
is denoted as f(Y\A, E, 77, t). We have independent priors vr(A), 7r(E), ^(77) 
and a score function, 

n 

(2.7) U(A,E,r,,Y,t) = - A^/S" 1 ^^- 1 ), A,) ??4 . 

i=i 

For fixed and ordered grid points along the path t( ) = < tm < ■ ■ ■ < 
*(5) < = 1) our P & th sampling estimate for the log Bayes factor is 

_ 1 s 

(2.8) log(BF h:h ^) = - £>-+i - t s )(E s+1 (U) + E S {U)). 

1 s=0 
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We simulate m samples of (At S) j, Ej, r/j : i = 1, . . . , m) from the posterior 
distribution of (At s , E, ?y) at the point < £ s < 1 and use them to estimate 
£.(E0 = £ E l/(A t „i, S i5 y), Vs = 1, . . . , S + 1. 
ii. Geometric Mean Path : This path is constructed over the distribu- 
tional space (Gelman and Meng (1998)), hence we model the density for the 
model M t at each point along the grid. We use the density /t(A, T,,rj\Y) = 
f(y\A, E, T)) w(A, E, rj) as the unnormalized density for the model Mt con- 
necting the prior and the posterior, when 7r(A, E,r?) and /(y|A, E,r/) are 
prior and the likelihood function correspondingly. By using PS along this 
path we can find the log marginal for the models Mq and Mi , as the normal- 
izing constant for the prior is known. Hence the log BF can be estimated 
by using those estimates of the log marginal for those models. The score 
function U (A, E, rj, Y, t) will be the loglikelihood function log f(y\A, E, 77). 

The theorem below verifies the regularity conditions of Path Sampling for 
Factor Models. For PS to succeed we also need convergence of MCMC at each 
point in the path. That will be taken up after proving the theorem. 

Theorem 2.1. Consider Path Sampling for factor models with parametric 
arithmetic mean path (PAMP), likelihood as given above for factor models and a 
proper prior for which the score function is integrable w.r.t. the prior, 

1. The interchangeability of integration and differentiation in (2.2) is valid. 

2. E t (U) is finite as t — > 0. 

3. The path sampling integral for factor models, namely (2.3), is finite. 

Proof. 1. Here for notational convenience, we write (A, E,??) = 9. When 
f(Y\8) and ir(9) are the likelihood function of the data and the prior density 
function for the corresponding parameter respectively, then the following is 
equivalent to showing eqn (2.2). 

si C OO C CO f-} 

- / f(Y\0,t)ir(0)dO = / -f(Y\d,t)ir(d)dd 

Qjt J— 00 J — 00 u£ 

We can write the LHS as following, 



,5^0 7-oo 6 

/oo 
f(Y\e,t')TT(e)de ,t' e [t,t + s] 
-00 



= lim / U(Y\0,t')f{Y\e,t')TT(0)dO 

<5-^0 ,/_oo 

where t' G [t,t + 5]. U is a quadratic function in 9, and hence its absolute 
value is bounded above by a quadratic function in 9, free of t but depending 
on Y. f(Y\6,t') is bounded by the global maximum of the MVN likelihood, 
say M, achieved at ft (eqn 2.6). Now applying the moment assumptions 
for Tt(6) we can use Dominated Convergence Theorem (DCT) and take the 
limit within the integral sign. 
2. We can write Et(U) as follows: 

= fU(Y9,t)f(Y\9,t)7r(9)d9 
1 I f(Y\9,t)n(9)d9 
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Thus 



lim£ t ([/) 

t — s-0 



lim t ^ fU(e,Y,t)f(Y\8,t)ir(8)d8 
[ j " lim^o C t 

where, C t = / f{Y\O,t)ir(O)d0. To prove 2, we apply the DCT to both the 
numerator and denominator as t — > exactly as above, that is using an 
upper bound of \U\ which is integrable w.r.t. ir and free of t and using 
the boundedness of f(Y\6,t). An exactly similar argument applies to the 
denominator and its limit can be identified as the marginal of Y under M$ 
and hence obviously positive. 
3. We prove Et(U(9,t)) is a continuous function of t for < t < 1. 
Note when t o e[0, 1] 

lim EAU) 

(2 10) = Hm t ^ to fu(e,Y,t)f(Y\e,t)<K(e)de ^ 

Now the proof holds exactly as for t = 0. 

□ 



In Remark 1, we extend the Theorem to other paths. Then in a series of 
Remarks we study various aspects like convergence and divergence of PS, that 
are closely related to the Theorem. All the remarks are related to the Theorem 
and insights gained from its proof. Remark 5 is the most important. 

Remark 1 For PS with GMP, the score function is the loglikelihood func- 
tion which can be bounded as before by using the RHS of equation (2.5). Also, 
/(y|A,S,ry)' < (1 V f(y\(l)) with Cl as in equation (2.5). We believe a similar 
generalization holds for most paths modeling means of two models. Now the 
proof of Theorem (2.1) applies exactly as before (i.e. as for PAMP). We exhibit 
performance of PS for this path in Section 4. 

Remark 2 If we further assume the MCMC average at each point on the 
grid converges to the Expectation of score function of MCMC then the Theorem 
implies the convergence of PS. We showed the integrand is continuous on [0, 
1]. So by continuity it can be approximated by a finite sum. Now take limit of 
MCMC average at each of these finitely many grid points. However, even if the 
MCMC converges in theory, the rate of convergence may be very slow or there 
may be a problem with mixing even for m=50,000, which we have taken as our 
gold standard for good MCMC. This problem will be apparent to some extent 
from high MCMC variance. 

Remark 3 As t — > the likelihood is practically independent of the extra 
parameters of the bigger model, so that a prior for those parameters (conditional 
on other parameters) will not learn much from data. In particular, the posterior 
for these parameters will remain close to the diffuse prior one normally starts 
with. If the prior fails to have the required finite moment in the theorem, the 
posterior will also be likely to have large values for moments, which may cause 
convergence problems for the MCMC. That's why we chose a prior making the 
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score function integrable. In the proof of the Thoerem, we have assumed the first 
two moments of the prior to be finite. In most numerical work our prior is a t 
with 5 to 10 d.f. 

Remark 4 In the same vein we suggest that even when the integral at t near 
zero converges, the convergence may be slow for the following reason. Consider a 
fixed (At, S, 77) with a large posterior or negative value of U (A t , £, rj)L(A t , S, 77) at 
point t, the same large value will occur at (jAt, S, rj) with prior weight 7r( |A t , S, 77). 
For priors like t-distribution with low degrees of freedom, At, E, 77) will not 
decay rapidly enough to substantially diminish the contribution of the large value 
of C/(At, S, r,)L(A t , S, 77) at (At, S, 77). 

Remark 5 The structure of the likelihood and prior actually provides insight 
as to when the MCMC will not converge to the right distribution owing to bad 
mixing. To this end we sketch a heuristic argument below, which will be supported 
in Subsection 3.4 by MCMC figures. 

1. The maximized likelihood remains the same along the whole path, because 
the path makes an one to one transformation of the parameter space as 
given below. 

2. If the MLE of A^ at t=l is A/,, then MLE at t = t' is (subject to variation 
due to MCMC at two different points at the path) , which goes to infinity as 
t goes to zero. This happens as the A^ remains the vector among A^ (where 
A^ is the MCMC sample from model Mf at t) having the highest maximum 
likelihood. Hence as t — > 0, ir(Xh/t) — > at a rate determined by the tail 
of the prior. The conflict between prior and maximized likelihood may also 
be viewed as a conflict between the nested models, with the prior favoring 
the parsimonious smaller model. This inherent conflict in model selection 
seems to have the following implications for MCMC. 

We expect to see a range (say [^1^2]) near zero showing a conflict between 
prior and maximized likelihood. Definitely the points t\ and £2 are not well- 
specified, but we treat them as such so as to understand some underlying issues 
of mixing and convergence here. On the set of points t > ti the MCMC samples 
are expected to be around the points maximizing likelihood, whereas for t < t\ 
they will be nearly zero due to the concentration around a value A^ which is both 
prior mode and the mle under Mq, namely A^ = 0. But for any point in the range 
[ti,t2]j they will span a huge part of the parameter space, ranging from points 
maximizing likelihood to ones having higher prior probability, showing a lot of 
fluctuations from MCMC to MCMC. The MCMC outputs in Subsection 3.4 show 
both clusters but having highly fluctuating values (Figure 4, Subsection 3.4) for 
the proportions of the clusters. 

Equation (2.7) tells us that the score function is proportional to -f (where 
\' h is the MCMC sample from model Mt at t). Hence we will see Et(U) as an 
increasing function while t — > ti from the right hand side ((2) in Remark 5). This 
leads to a lot of variation of the estimate of Et(U) for different MCMC samples 
in the range [ii,t2] as explained above. Also, as explained above, for t < t\, the 
score function will concentrate near zero. 

The width of the zone of conflict (here ti — 1\) will shrink, if we have a relatively 
strong decaying tail of the prior. On the other hand for heavy-tailed priors we may 
see these above mentioned fluctuations for a longer range, causing a loss of mass 
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from the final integration. These problems are aggravated by the high-dimension 
of the problem and the diffuse spread of the prior on the high-dimensional space. 
This may mean the usual BF estimated by PS will be off by an order of magnitude. 
We will see the implications reflected in some figures and tables in the next 
Section, when we study PAMP-PS for Factor Models in detail in Section 3. 

Remark 6 We have checked that adaptive choice of grid points by Lefebvre et. al. 
(2009), which improves accuracy in their two examples with GMP, doesn't help 
in the case of the very large fluctuations described above. It seems to us that 
adaptive choice would work better when the two models tested are less dissimilar 
than the models in Remark 5, e.g., when the smaller of two nested models is true 
(Subsection 3.1) or when our proposed modification of PS is used (Subsection 
3.3). However, we have not verified this because even without adaptive choice, 
our new proposal worked quite well in our examples. 

We note in passing that in both the examples of Lefebvre et. al. (2009), the 
two models being tested have maximum likelihoods that differ by fifteen in the 
log scale, whereas for the models in Remark 5 they differ by much more, over 
hundred. 

3. WHAT DO ACTUAL COMPUTATIONS TELL US? 

Following the discussion in previous Section, we would like to study the effects 
of the theoretical observations in the previous Section for the implementation 
of Path Sampling. Here we only consider the PAMP for PS, and for notational 
convenience we will mention it as just PS. After studying estimated BF's in 
several simulated data sets (not reported here) from various factor models, we 
note a few salient features. Error in estimation of the BF or the discrepancy 
between different methods tends to be relatively large, if one of the following is 
true : the data has come from the complex model rather than the simpler model, 
the prior is relatively diffuse or the value of the precision parameters are relatively 
small. Different subsections study what happens if the complex or simpler model 
is true, the effect of the prior, the grid size and MCMC size. These are done in 
Subsections 3.1-3.2. 

In Subsection 3.3, we introduce a new PS scheme, which operates through 
a chain of paths, each path involving two nested models with a small change 
between the contiguous pairs. The new scheme is denoted as Path sampling with 
Small Changes (PS-SC). The effect of precision parameters will also be studied 
in this subsection for PS-SC. Then we study the MCMC samples and try to 
understand their behavior from the point of view of explaining the discrepancy 
between different methods for estimating Bayes Factors and why PS-SC does 
better than PS in Subsection 3.4. 

Our simulated data are similar to those of Ghosh and Dunson (2009) but have 
different parameters. We use a 2-factor model and 1-factor model as our complex 
model M\ and simpler model Mq correspondingly to demonstrate the underlying 
issues. The loading parameters and the diagonal entries of the £ matrix are 
given in Table 2 & 3. In simulation we take model Mq or Mi as true but £ is not 
changed. Of course if the one factor model Afo is true, then since it is nested in 
M\, Mi is also true. 
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Factor 1 


.89 





.25 





.8 





.5 


Factor 2 





.9 


.25 


.4 





.5 






Table 2 

Loading Factors used for simulation 



.2079 


.19 


.15 


.2 


.36 


.1875 


.1875 



Table 3 



Diagonal Entries of E 

3.1 Issues in Complex (2-factor) model 

We will study the effect of grid size, prior and the behavior of MCMC, keeping 
in mind Theorem 2.1 and the Remarks in Section 2. For Path Sampling with 
PAM path, we now discuss the effect of prior and the two tuning parameters, 
namely, the effect of the grid size and MCMC size, on the estimated value of 
the BF and their variance. Following the discussion in Remarks 3 8z 4, we know 
that ]imt-toEt(U) is finite and path sampling converges under some finite mo- 
ment assumption for the prior. The prior considered in PS by Ghosh and Dunson 
(2008) are Cauchy and half-Cauchy which do not have any finite moments and 
so U is not integrable. We therefore choose a relatively diffuse prior, but with 
enough finite moments for U. For finite mean and variance one needs a t with 
at least four degrees of freedom. Our favorites are t-distributions with 5 to 10 
degrees of freedom. We show results for 5 and 10 d.f. only. But we first explore 
the sensitivity of the estimate to changes in d.f. of the t-distribution as prior, over 
a range of 1 through 90. The BF values change considerably until we reach about 
40 d.f. and then it stabilizes. In Table 4 we report the log BF values estimated 
for 5 datasets simulated from 2-factor model using different priors. The behavior 
of the estimated log BF with the change of d.f. continuously from 1 to 100 is 
shown in the figure 1 for 3rd dataset. 





PS using gric 


size .01 




tl 


is 


tio 


^90 


normal 


2.62 


14.42 


22.45 


70.20 


70.25 


3.67 


11.90 


21.39 


68.70 


68.72 


3.00 


13.43 


21.31 


47.06 


47.21 


4.29 


13.17 


18.49 


48.03 


48.13 


4.20 


13.11 


18.48 


47.70 


47.74 



Table 4 

PAM-PS: Dependance of logBF2i over prior, 2-factor model true. 

We can see the estimate of the BF changing with the change in the pattern 
of the tail of the prior. The effect of the grid size and MCMC size on MCMC- 
variance of the estimate are studied, using priors t\Q and N(0,1) and reported in 
Table 5. We report mean of the estimates found from 10 different MCMC runs 
and the corresponding standard deviation. The study has been done on the 2nd 
of the 5 datasets simulated from Model 1 earlier. 

As expected Table 5 shows a major increase of MCMC-size and finer grid-size 
reduces the MCMC-variance of the estimator. The difference between the mean 
values of BF estimated by t\$ and N(0,1) differ by an order of magnitude. We 
will study these issues as well as special patterns exhibiting MCMC in Subsection 
3.4. Though the different variants of PS compared here differ in their estimated 
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50 




O 20 40 SO 80 1 OO 

Degrees of Freedom 



Fig 1. Dependance of logBF-2\ over prior for 3rd Data set. 



Grid Size 


.01 


.001 


MCMC-size Prior 


Data 2 


Data 2 


5000 tio 
N(0,1) 


21.26 (1.39) 
66.89 (4.16) 


21.2625 (1.29) 
67.2158 (3.18) 


50000 iio 
N(0,1) 


23.7189 (1.8) 
68.1035 (3.72) 


23.5783 (.52) 
68.23 (3.66) 



Table 5 



PAM-PS: Dependence of log BF21 estimates over MCMC size, 2-f actor model true 

value of BF, they still choose the correct model 100% of the time. 

3.2 Issues in Simpler (1-factor) model 

Now we study the scenario when the 1-factor model is true focusing on the 
effect of prior, grid-size and MCMC-size on the estimated Bayes Factor (Tabic 
6). In this scenario the estimates don't change much with the change of prior, 
so we will report the estimates for prior tio an d N(0,1) with different values of 
MCMC-size and grid-size. 



Grid Size 


.01 .001 


MCMC-size Prior 


Data 1 Data 1 


5000 tio 
N(0,1) 


-4.26 (.054) -4.2702 (.044) 
-4.62 (.052) -4.6045 (.051) 


50000 tio 
N(0,1) 


-4.2457 (.012) -4.246 (.007) 
-4.6064 (.0065) -4.6267 (.005) 



Table 6 



PAM-PS: Dependence of log BF21 estimates over MCMC size, while 1-factor model true 

This table shows us that the MCMC-variance improves with the finer grid- 
size and large MCMC-size as expected, but the estimated values of BF21 remain 
mostly same. As noted earlier, PS chooses the correct model 100% of the time 
when Mo is true. 

We explain tentatively why the calculation of BF is relatively stable when the 
lower dim model Mq is true. Since Mq is nested in Mi, Mi is also true in this 
case, which in turn implies both max likelihoods (under Mo &: Mi) are similar 
and smaller than for data coming from Mi true (but not Mo). This tends to 
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reduce or at least is associated with the reduction of the conflict between the two 
models or prior and likelihood along the path mentioned in the Remark 5. 

Moreover, the score function for small t causes less problem since for data 
under Mq, X' 2 is relatively small compared with that for data generated under 
Mi. 

So we see when two models are close in some sense, we expect their likelihood 
ratio will not fluctuate widely provided the parameters from the two parameter 
spaces are properly aligned, for example, if found by minimizing a K-L divergence 
between the corresponding densities or taking a simple projection from the bigger 
space to the smaller spce. This is likely to make Importance Sampling more stable 
than if the two models were very different. It seems plausible that this stability 
or its lack in the calculation of BF will also show up in methods like PS that 
are derived from Importance Sampling in some way. Ingenious modifications of 
Importance Sampling seems to mitigate but not completely solve the problem. 
Following this idea of closer models in some sense, we modify PS in a similar 
manner below. 

3.3 Path Sampling with Small Changes : Proposed Solution 

In Remark 5, Subsection 3.1, a prior-likelihood conflict was identified as a cause 
of poor mixing. This will be re-examined in the next subsection. In the present 
subsection we propose a modification of PS which tries to solve or at least reduce 
the magnitude of this problem. 

To solve this problem without having to give up our diffuse prior, we try to re- 
duce the problem to a series of one-dimensional problems so that the competeting 
models are close to each other. We calculate the Bayes Factor by using the path- 
sampling step for every single parameter that may be zero, keeping others fixed. 
It is easily seen that the original log Bayes Factor is the sum of all the log Bayes 
Factors estimated in these smaller steps. We denote this procedure as PS-SC 
(Path Sampling with Small Change) and implement with parametric arithmetic 
mean path (PAMP). More formally, if we consider A2 as a p-dimensional vector, 
then Mq and M\ differ only in the last p — 1 parameters, as A21 is always zero 
due to upper-triangular condition. We consider p models M[ : i = 1, . . . ,p, where 
for model M[ we have first i parameters of A2 being zero correspondingly. If we 
define BF( i+1 = ^~^7~^ > when rrii(x) is the marginal for the model M- then, 

p-i 

logBF 21 = J2^gBFl l+1 . 

i=i 

So we perform p — 1 pathsampling computations to estimate log BF( i+1 , Mi = 
1, . . . ,p — 1. And for each of the steps the score function will be of the following 
form, 

n 

Ul(A,V,ri,Y,t) = £ ( Vj - A^/E- 1 ^*^- 1 ), [Oi; Afc.i+ijOp-i-i])^, 
i=i 

where A t = (Ai, [0;; iA 2 ,j+i; A 2) (i+2,..., P )])- 

As in the case of small model true, the max likelihood under both models are 
close, and generally the two models are close, suggesting fluctuations are less 
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likely and true BF isn't very large. This seems generally to lead to stability of 
computation of BF. 

Also the parameter X' 2 is now one dimensional. So the score function is more 
likely to be small than when \' 2 is a vector as under PS. We also notice that in 

A' A' 

each step the score function is not anymore proportional to -j- but rather to -f- 
which will be much smaller in value, hence reducing the fluctuation and loss of 
mass. 

Computational implementation shows it to be stable for different MCMC-size 
and grid size regarding MCMC-variance and also produces smooth curve of Et{U) 
for every single step. Here we use MCMC-size of 5000/50000 and grid size of .01 
for our study and report the corresponding estimated BF values for two datasets 
from 1-factor and 2-factor model respectively. The MCMC-standard deviation of 
the estimates along with the mean of the estimated value over 10 MCMC runs 
are reported in Table 7. PS-SC has smaller variance than PS under both Mq and 
Mi. In Section 2 and Subsection 3.4, we argue that, at least under Mi, PS-SC 
provides a better estimate of BF. 



True Model 


MCMC Size 


PS-SC PS (tio) PS (N(0,1)) 


1-factor 


5000 


-8.09 (.013) -4.26 (.054) -4.62 (.052) 


1-factor 


50000 


-8.0892 (.0067) -4.24 (.012) -4.60 (.0065) 


2-factor 


5000 


80.14 (.67) 21.26 (1.39) 66.89 (3.8) 


2-factor 


50000 


80.75 (.83) 23.7189 (1.8) 68.10 (3.88) 



Table 7 

log_BF2i (variance over MCMC sample) estimated by PAM-PS-SC 



Now we see the effect of changing the precision parameters keeping the factor 
loadings as before. The diagonal entries of £ are in Table 8. The precision of these 
3 models lie in the ranges of [2.77, 6.55], [1.79, 2.44], [1.36, 1.66] correspondingly. 



Model 1 


.2079 


.19 


.15 


.2 


.36 


.1875 


.1875 


Model 2 


.553 


.52 


.48 


.54 


.409 


.55 


.54 


Model 3 


.73 


.71 


.67 


.7 


.599 


.67 


.72 



Table 8 



Diagonal Entries of E in the 3 different models: the first one is modified from Ghosh and 

Dunson (2008) 

We study PS-SC for 6 datasets generated from the 3 models (2 datasets with 
n=100 from each model: Data 1 from 1-factor and Data 2 from 2-factor model) 
and report the estimated Bayes Factor value in Table 9. 



Model 


True Model 


Data 


PS-SC 


PS (t M ) 


Model 1 


1-factor 


Data 1 


-8.09 (.012) 


-3.84 (.055) 




2-factor 


Data 2 


71.59 (.66) 


19.81 (1.38) 


Model 2 


1-factor 


Data 1 


-11.01 (.0066) 


-3.09 (.0277) 




2-factor 


Data 2 


51.41 (.3658) 


2.8 (1.9104) 


Model 3 


1-factor 


Data 1 


-5.13 (.0153) 


-2.6 (.0419) 




2-factor 


Data 2 


3.975 (.0130) 


2.2 (.3588) 



Table 9 

log BF21 estimation by PS-SC : Effect of Precision Parameter 



The effect of precision parameters are seen on the estimated value of the Bayes 
Factor (BF), more prominently when the 2-factor model is true. Generally the 
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absolute value of the BF decreases with the decrease in the value of the precision 
parameters. For the smaller value of precision parameters, we expect the model 
selection to be less conclusive, explaining the pattern shown in the estimated BF 
values. 

Under Mi, PS is often bad in estimating the Bayes Factor {BF21) but since the 
true Bayes Factor is large, it usually chooses the true model as often as PS-SC. 
When Mq is true, PS is much better in estimating the Bayes Factor but since 
the Bayes Factor is usually not that large, it doesn't choose Mq all the time. 
The probability of choosing Mq correctly depends on the data in addition to the 
true values of the parameters. PS-SC does better than PS in all these cases, it 
estimates BF21 better and chooses the correct model equally or more often. The 
sense in which PS-SC estimates BF21 better has been discussed in detail earlier 
in this Section. Under Mq PS-SC estimates BF21 better by having a smaller, i.e., 
more negative value than PS. 

3.4 Issues regarding MCMC Sampling 

^00 

soo 

500 

400 

— 300 
I I I 

200 
1 00 
o 

1 00 




Fig 2. E t (U) for prior tw and N(0, 1), 2-factor model is true. 




Fig 3. Loglikelihood for prior tio and N(0, 1), 2-factor model is true. 

This subsection is best read along with the Remarks in Section 2. We first 
study the graph of E t (U) and the likelhood values for the MCMC samples at t 
for both the t\o and N(0,1) prior (figure 2 & 3). We will plot the likelihood as a 
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scalar proxy since we can not show fluctuations of the vector of factor loadings in 
the MCMC output. The clusters of the latter can be inferred from the clusters of 
the former. We will argue that there are two clusters at each grid point and the 
mixing proportion of the two clusters has a definite pattern. 

800 i , , , 




t 



Fig 4. Et(U) and Loglikelihood for prior tio in the range t G [0, .2], 2-factor model is true. 

Under the true 2-factor model M±, denote A' = [A'i, A' 2 ] where A^ is the loading 
for the corresponding latent factor under Mt- Here \' 2 is a 7x 1 vector and becomes 
zero, as it approaches Mq from M\ (as t —¥ 0). The posterior distribution at each 
Alt can be viewed roughly as a sort of mixture model with two components 
representing Mq and Mi, the form of the likelihood as given in Theorem 1. In 
the diagram (Fig 4) of log-likelihood of MCMC samples, we see two clear clusters 
around log-likelihood values -850 and -925, representing MCMC outputs with 
nonzero \ 2 and zero X' 2 values respectively. We may think of them as coming from 
the component corresponding to M\ (cluster 2) and the component corresponding 
to Mq (cluster 1). Samples of both clusters are present in the range [.03, .2], 
while samples appear to be predominantly from cluster 2 until t=.l. A good 
representation of samples from cluster 1 are only present in the range [0,.l]. In 
the range [.03,. 2], both clusters occur with proportions varying a lot. Moreover 

here the magnitude of the score function is proportional to We see these 
fluctuations in Fig (4) in the region [.03, .2]. This is also brought out by the 
MCMC standard deviation of Et{U) which are of order of 30-50 in log scale. 

We notice the absence of any samples from Mi for t < .03, except some chaotic 
representation for a few random values of t (notice in the figure, a spike repre- 
senting samples from Mi at £=0.016), clearly representing poor mixing of MCMC 
samples near the model Mq. 

The new method PS-SC stabilizes the estimated Bayes Factor value with a 
very small MCMC-variance. Here we check through figures 5 & 6 that it avoids 
prior-likelihood conflict and the problem about mixing for MCMC samples seen 
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0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 



Fig 5. Histograms for \' 2 2 for different values oft near t=0 (MCMC size used 50,000), using 
PS. 




0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 



Fig 6. Histograms for X' 2 2 for different values oft near t=0 (MCMC size used 50,000), using 
PS-SC. 

for the standard PS. We concentrate our study for the first step of PS-SC. In 
this step only the first component of A' 2 , A' 22 converges to zero as t — > 0. So 
here we consider the spread of MCMC sample of A 22 for different values of t 
near i=0, from both PS and PS-SC in figure 5 &; 6 by considering the histogram 
of MCMC sample of A 22 . We can easily notice that the spread of the MCMC 
sample fluctuates in between the two modes in a chaotic manner showing poor 
or unstable mixing for PS, whereas PS-SC samples come from both the clusters 
and slowly shift towards the prior mode as t — > 0. We have also studied but don't 
report similar nice behavior regarding mixing of MCMC of PS-SC for the data 
simulated from 1-factor model. 

The poor mixing discussed above for MCMC outputs for PS will now be il- 
lustrated with plots of auto-correlation for A 22 for different lags (Figure 7). For 
the sake of comparison we do the same for PS-SC (Figure 8). Clearly except very 
near t=0, i.e., in what we have called the chaotic zone, the auto-correlations for 
PS are much bigger than those for PS-SC. However, near i=0, though in plots in 
both Figure 7 & 8 are small, those for PS are slightly smaller. We have no simple 
explanation for this. 
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Fig 7. Autocorrelation for X' 2 2 for 
using PS. 
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Fig 8. Autocorrelation for W 2 2 f or different values of t near t=0 (MCMC size used 50,000), 
using PS-SC. 



Poor mixing seems to lead to missing mass and random fluctuations for calcu- 
lations for Et(U). This probably explains the discrepancy we have noticed in the 
estimation of BF by PS as compared with PS-SC. We now look at autocorrela- 
tions for a first factor loading in Figure 7 and second factor loading in Figure 8. 
The top rows in each of the two figures show zero autocorrelation as they are very 
close to t = 0. On the other hand, high autocorrelations are shown in the next 
two rows. We believe they correspond to what we called a chaotic region. The 
bottom two rows of Figure 8 show small autocorrelation. They correspond to the 
second factor loading which comes in only Model 2, and they also depict the zone 
dominated by Model 2. The other figure is in the same zone as in the previous 
line, but the variable considered is a 1-factor loading. Here also autocorrelation 
eventually tends to 0, but its values are bigger than in Figure 8. We don't have 
any simple explanation for this higher autocorrelation. 

The above discussion covers the case when the more complex model is true. If 
the simpler model (Mo) is true, as noted in Subsection 3.3 both PS and PS-SC 
perform well in estimating the Bayes Factor as well as choosing the correct model. 
The Bayes Factor based on PS-SC provides stronger support for the true model 
than the Bayes Factor based on PS. 
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4. IMPLEMENTATION OF OTHER METHODS 

We have explored several methods of estimating the ratio of normalizing con- 
stants, for example the methods of Nielsen (2004), DiCiccio et. al. (1997), Rue et. al. 
(2009) and Chib (1995). The method of Rue et. al. (2009) models a link func- 
tion of means but here we are concerned with models for the variance-covariance 
matrix. We could not use Chib's method here since for our parameter expanded 
prior the full conditionals of the original model parameters are not available. But 
we were able to implement the deterministic variational Bayes method of Nielsen 
(2004) and the Laplace approximation with a correction due to DiCiccio et. al. 
(1997). Since the results were not satisfactory, we do not report them in this 
paper. In the variational Bayes approach, the method selected the correct model 
approximately 80% of the time but the estimated logBF values were consider- 
ably over (or under) estimated. The variational Bayes method is worth further 
study, possibly with suitable modifications. It appears to us it is still not under- 
stood when Belief Propagation provides a good approximation to a marginal or 
not, e.g., Gamarnik et. al. (2010) commented. Only recently we have witnessed 
an explosion of research for theoretical understanding of the performance of the 
BP algorithm in the context of various combinatorial optimization problems, both 
tractable and intractable (NP-hard) versions. 

Following the discussion in the Subsection 2.4, we have implemented the GMP- 
PS. Here the marginal for both models is estimated by constructing a path be- 
tween the prior distribution to the posterior distribution of the model. Due to 
very high-dimensionality of the model, the mode of prior and posterior distribu- 
tion are far apart. So as discussed before, the MCMC sampling along the path 
fails to sample smoothly and fluctuates between the two modes in a chaotic way 
near the prior mode. Hence the estimate of marginal of both the models become 
very unstable. Due to the poor estimation of BF this method also fails to choose 
the correct model very often. As in the case of GMP-PS, AIS with GM path 
did not also work well. Hence, we implemented AIS with PAM-path. Implemen- 
tation of PAM-AIS is also very time intensive, so we have only implemented 
PAMP-AIS with MCMC sample size 5000. PAM-AIS not only shows very high 
MCMC-variance, but it also fails to choose correct model many a time, when the 
2-factor model is correct. The last methods we looked at are, 

1. Importance Sampling (IS). 

2. Newton-Raftery approximation (BICM). 

3. Laplace/BIC type approximation (BICIM). 

IS is the most easy to implement and shows moderately good results in choosing 
the correct model (Ghosh and Dunson (2008)). We study the stability of Bayes 
Factor values estimated by IS with the change of the MCMC size in Table 10. 

Similarly we also study the stability of the estimates of Bayes Factor by BICM 
and BICIM (explained in A. 1.3 in the Appendix) using MCMC sample size 10,000, 
where both of these methods show significantly less amount of MCMC-variance 
than other methods considered. Hence we will only consider PS-SC, BICM and 
BICIM to explore model selection for dimension much higher than previously 
considered. 
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Method(MCMC-size) / True Model 


2-factor Model 


1-factor Model 


IS (10,000) 


109.78 (168.72) 


.0749 (.1063) 


IS (50,000) 


97.12 (61.25) 


-5.39 (84.35) 


IS (100,000) 


86.92 (110.35) 


-3.07 (10.41) 


IS (200,000) 


83.66 (58.53) 


-2.69 (2.96) 


BICM (10,000) 


68.66 (.93) 


-5.72 (.62) 


BICIM (10,000) 


67.9 (.11) 


-5.3 (.57) 


PS-SC (5,000) 


80.75 (.63) 


-8.08 (.0013) 



Table 10 

Study of IS, BICM and BICIM for different MCMC size: Estimated Bayes Factor (MCMC 

Standard Deviation) 

5. EFFECT OF PRECISION PARAMETERS AND HIGH DIMENSIONAL 
(SIMULATED AND REAL) DATASET 

Our goal is to explore if PS-SC may be made more efficient by combining with 
BICM and BICIM and also to explore number of dimension much higher than 
before and real life examples. 

In the examples in this section, p varies from 6 to 26. We have 2 examples 
of real life examples with p=6 and 26 and simulated example with p=20. As 
expected PS-SC still takes long time, even with a parallel processing for high 
dimensional examples. We explore whether PS-SC can be combined with BICM 
and BICIM to substantially reduce time. Since their performance seems much 
faster than PS-SC. 

We compare the behavior of these methods for higher dimensional model and 
for some real datasets taken from Ghosh and Dunson (2009) and Akaike (1987). 
We first consider one 3- factor model with p=20 and n=100. 



Data 


BF 


PS-SC BICM BICIM 


Datal (k=l) 


BF 2 i 
BF 32 
BF 4S 


-25.91 (.0233) -32.68 -38.01 
-24.84 (.0594) -21.18 -38.24 
-22.79 (.0483) -19.81 -43.77 


Data2 (k=2) 


BF21 
BF 32 
BF 43 


225.81 (4.2099) 248.09 219.87 
-23.61 (.0160) -23.59 -46.17 
-19.18 (.0297) -20.3 -47.98 


Data3 (k=3) 


BF21 
BF 32 
BF 43 


152.07 (1.7422) 185.45 162.3 
104.17 (2.5468) 198.1 168.54 
-17.35 (.0276) -29.73 -48.24 



Table 11 

Simulated model (p=20, n=100) and (k—the number of true factors) : Comparison of log 

Bayes Factor 

We notice that all the methods are selecting correct models for all the 3 
datasets, but based on our earlier discussion of PS-SC, we believe only this 
method provides a reliable estimate of BF. Now we will compare the methods 
for some real datasets. We choose two datasets: "Rodent Organ Data" from 
Ghosh and Dunson (2009) and "26-variable Psychological Data" from Akaike 
(1987). These datasets have been normalized first before analyzing them fur- 
ther. We not only study the estimated Bayes Factor but also the model chosen 
by them. 

In the "Rodant Organ Data" the model chosen by PS-SC and other meth- 
ods are correspondingly 3-factor model and 2-factor model. For the "26-variable 
Psychological Data", where PS-SC, and BICM/BICIM chooses the model with 
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Bayes factor 


PS-SC 


BICM 


BICIM 


logBF 2 i 


4.8 


26.34 


21.57 


logBFzi 


10.52 


-3.14 


-10.01 


logBF 4a 


-3.28 







Table 12 

Rodant Organ Weight Data (p=6, n=60) : Comparison of log Bayes factor 



Bayes factor 


PS-SC 


BICM 


BICIM 


logBF 2 i 


122.82 


205.27 


188.19 


logBFs2 


35.27 


71.05 


35.5 


logBF 43 


-10.7 


23.16 


7.55 


logBF 54 


-33.32 


-4.63 


-25.51 


logBF 65 


-16.7 


-17.32 


-43.21 



Table 13 



26-variable Psychological data (p=26, n=300) : Comparison of log Bayes factor 



3 factors and 4 factors respectively. The models chosen by PS-SC and the other 
methods are close, but as expected differ a lot in their estimate of BFs. 

There is still no rigorously proved Laplace approximation for relatively high 
dimensional cases, i.e. when the the number of parameters increases at least as 
fast as sample size. Laplace Approximation have not kept up with the growth of 
novel MCMC and other computational techniques (like PS), see Andrieu et. al. 
(2004), who also give reasons for those, namely, lack of extended analytical input 
and absence of intuitive evaluations of degree of approximations involved. Also, 
Clyde and George (2004), point out that the usual Laplace approximations in- 
volve the sample size but in some problems the "sample size" isn't clear. In BICM 
one avoids the sample size by dealing with the full observed Fisher Information 
matrix, not it's per unit version obtained by dividing by sample size. So the two 
heuristic Laplace approximations we have used, one a simple modification of the 
Kass-Wassreman version (Berger et. al. (2003)) and another more recent, due to 
Raftery et. al. (2007) seem to be good as a preliminary searching method to nar- 
row the field of plausible models before using PS-SC. This saves time relative to 
PS-SC for model search as seen in the previous examples. 

6. CONCLUSION 

We have studied PS for Factor Models (and one other toy example) and have 
identified the component of PS that is most likely to go wrong and where. This 
is partly based on the fact that we have a relatively simple sufficient condition 
for factor models (Theorem 1). Typically for the higher dimensional model the 
MCMC output for finding the integral along grid points in the path may become 
quite unreliable at some parts of the path. Some insight about why it happens 
and how it can be rectified has been suggested. MCMC seems to be unreliable for 
PS when the higher dimensional model is true. The problem is worse the more 
the two models differ as when a very high dimensional model is being compared 
to a low dimensional model. 

The suggestion for rectification was based on the intuition that PS, like Impor- 
tance Sampling itself, seems more reliable when the two marginal densities in the 
Bayes Factor are relatively similar, as is the case when the smaller of two nested 
models is true Based on this intuition we suggested PS-SC and justified PS-SC 
by comparing MCMC output and MCMC variance of both PS-SC and PS. 
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It is our belief that the above insights as to when things will tend to go wrong 
and when not, will also be valid for the other general strategy for selection from 
among nested models namely, RJMCMC. Piyas Chakraborty in Purdue is work- 
ing on a change point problem in several parameters where Shen and Ghosh 
(2011) have an accurate approximation to the Bayes Factor, which may be used 
for validation. He will explore small changes as well as adaptive MCMC. 

Our work has focused on model selection by Bayes Factors, which seems very 
natural since it provides posterior probability for each model. However, model 
selection is a complicated business and one of its major purposes is also to find a 
model that fits the data well. Several model selecting statisticians feel this should 
also be done along with calculation of Bayes Factors. 

However, there has not been a good discussion on how one should put together 
the findings from the two different approaches. We hope to return to these issues 
in a future communication. 



7. ACKNOWLEDGEMENT 

We thank Joyee Ghosh for helping us in discussions on factor models and 
sharing her code and Andrew Lewandowski for thought-provoking comments on 
an earlier draft. 



APPENDIX A: APPENDIX 

A.l Other methods 

A. 1.1 Importance Sampling Suppose we have two densities proportional to 
two functions f(x) and g(x), which are feasible to evaluate at every x, but one 
of the distributions, say the one induced by f(x), is not easy to sample. Then the 
importance sampling (IS) estimate of the ratio of normalizing constants is based 
on m independent draws x±, . . . ,x m generated from the distribution defined by 
g(x). We first compute the importance weights Wi = ^ \ and then define the IS 
estimate: 



Wi. 

m f— : ' 

i=i 

Under the assumption that g(x) ^ when f(x) ^ 0, J2i=i w i converges as 
m — > oo to Zf/Zg, when Zf = J f(x)dx and Z g = J g(x)dx are the normalizing 
constants for f(x) and g(x). The variability of the IS estimate depends heavily 
on the variability of the weight functions. So to have a good IS estimate we need 
to have g(x) as a good approximation to f(x), which is difficult to achieve in 
problems with high or moderately high dimensional, possibly multimodal density. 

Analysis of Bayesian factor models using IS has been introduced by Ghosh and Dunson 
(2008). The IS estimator of BF for factor models is based on m samples 9^ from 
the posterior distribution, under AfW 



' m S v(y\ef\k = h) 
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which in turn is based on the following identity: 
(A.3) 



= p(y\k = h - l) 

p(y\k = h) 

Ghosh and Dunson (2008) implemented IS with a parameter expanded prior. 
They also have noted that IS is fast and often (90%) chooses the correct model 
in simulation. In our simulation IS chooses a true bigger model correctly, but a 
20% error rate was observed when the smaller model is true. 

A. 1.2 Annealed Importance Sampling Following Neal (2001) we consider den- 
sities pt : t € [0, 1] joining the densities po and p\. We choose densities by dis- 
cretising the path p t(i) where = < . . . < t^) = 1 an d then simulate a 
Markov chain designed to converge to Pt {k) ■ Starting from the final states of the 
previous simulation we simulate some number of iterations of a Markov chain 
designed to converge to Pt {k _ 1) - Similarly we simulate some iterations starting 
from the final steps of pt^ designed to converge to Pty-i) un ^ we simulate some 
iterations converging to Vtm ■ This sampling scheme produces a sample of points 



PC 

the ratio of normalizing constant becomes as follows: 



aii, ... ,x m and then we compute the weights Wi = pj^'j • Then the estimate of 



(A-4) -E 



Wi 

m . 



Notice that while both AIS and PS are based on MCMC runs along a path from 
one model to another while the MCMC'S are drawn at each point, but the details 
are very different. Due to the better spread of MCMC samples, the estimates in 
AIS seem to be better than those calculated by IS when the smaller model is 
true, helping in correct model selection and also improving the estimation of 
Bayes Factors. However, simulations show that AIS has the same problem as IS 
in estimating the Bayes Factor when the bigger model is true. 

A. 1.3 BIC type methods : Raftery- Newton and our method using Information 
Matrix In contrast to the methods previously discusssed, we try to directly es- 
timate the marginal under each model and then use these marginals to find the 
Bayes Factor. We know that BIC is an approximation to the log-marginal based 
on a Laplace- type approximation of the log-marginal Ghosh et. al. (2006), under 
the assumption of i.i.d. observations. Thus 

(A.5) 

log(m(x)) « log(f(x\0)ir(8)) + (p/2)log(27r) + (p/2)log(n) + log^H^ 1 / 2 ) 

where H l q is the observed Fisher Information matrix evaluated at the maximum 
likelihood estimator using a single observation. For BIC we just use 

(A.6) log(m(x)) « log(f(x\§M§)) + (p/2)log(n) 

« log(f(x\§))) + (p/2)log(n) 
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ignoring other terms as they are 0(1). 

It is known BIC may be a poor approximation to the log-marginal in high 
dimensions (Berger et. al. (2003)). To take care of this problem, Raftery et. al. 
(2007) suggest the following. Simulate iid MCMC samples from the posterior 
distributions, evaluate independent sequence of log(prior xlikelihood)s (log-p.l.) 
{It ■ t = 1, . . . , m}, and then an estimate for the marginal is 

(A.7) log{m(x)) « I- sf(log(n) - 1) 

where I and sf will be sample mean and variance of It's. We call this method 
BICM, following the convention of Raftery et. al. (2007). 

In order to apply A. 5, we do not need to evaluate n since it cancels by combining 
the last two terms. This suggests the approximation A. 5 take care of the point 
raised by Clyde and George (2004). However A.7 does use n, but we do not know 
the impact on the approximation. 

We have also used the Laplace approximation (A. 5) without any change as 
likely to work better than the usual BIC. We compute the Information Ma- 
trix at the maximum prior x likelihood (mpl) value under the model and impute 
its value in the computation of marginal. To find the mpl estimate we use the 
MCMC sample from the posterior distribution and pick the maxima in that sam- 
ple. Then we search for the mple in its neighbourhood, using it as the starting 
point for the optimization algorithm. In our simulation study, it has been seen 
to give very good results similar to the computationaly intensive numerical algo- 
rithms used to find the maximum of a function over the whole parameter space 
seen by taking repeatation of MCMC runs and large MCMC samples. In the 
spirit of Raftery et. al. (2007), we call this method BICIM, indicating the use 
of Information Matrix based Laplace Approximation. We also used several other 
modifications that did not give good results, so are not reported. 

A. 2 A Theoretical Remark on the likelihood function 

It appears that the behavior of the likelihood, e.g. its maximum plays an im- 
portant role in model selection, specifically in the kind of conflict we see between 
PS and the Laplace approximations (BICM, BICIM) when the bigger model is 
true (and the prior is a t with a relatively small d.f.). The behavior seems to 
be different from the asymptotic behavior of maximum likelihood under the fol- 
lowing standard assumptions. Assume dimension of the parameter space is fixed 
and usual regularity conditions hold. Moreover, when the big model is true but 
the small model is assumed (so that it is a misspecified model) Kullback-Liebler 
projection of the true parameter space to the parameter space of the small model 
exists (Bunke and Milhaud (1998)). 

Fact Assume the big model is true, and the small model is false. Then, as may 
be verified easily by Taylor expansion, 

1 . logL(0 6iff )-logL(e true (bi g ) )=O p (T) 

2. logL(0 sma ^)-logL(KL projection of 9 tr ue(big) to ® sma ii)=Op{l) 

3. logL(6» irue(feis) )-logL(KL projection of 9 true(Mg) to @ sma u)=Op(n) 
and 
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4. 

logL(9 big ) - logL[6 sm all) 

= logL(6 true{big )) - log L(KL projection of 

"true(big) 

tO @ small) +Op(l) 

= P (n) 

The maximized likelihood for Factor models substantially over estimates the 
true likelihood unlike relation (1) above. Unfortunately, as pointed out in Drton 
(2009) the asympyotics of mle for Factor Models is still not fully worked out. 

A. 3 Matrix used for the Toy example 



/ 128.35 


52.69 


-19.25 


-11.86 


24.34 


8.80 


10.63 


13.75 


-7.40 


-29.80\ 


52.69 


73.37 


-21.04 


-37.85 


12.29 


8.74 


15.60 


12.09 


-14.08 


-17.27 


-19.25 


-21.04 


30.86 


8.63 


-1.41 


-13.58 


-3.03 


-11.64 


21.28 


22.05 


-11.86 


-37.85 


8.63 


80.49 


4.66 


3.26 


-49.24 


-9.68 


22.18 


8.52 


24.34 


12.29 


-1.41 


4.66 


15.45 


2.58 


2.05 


3.72 


-1.31 


-7.87 


8.80 


8.74 


-13.58 


3.26 


2.58 


31.37 


11.62 


-4.85 


-16.89 


-20.10 


10.63 


15.60 


-3.03 


-49.24 


2.05 


11.62 


58.09 


7.00 


-19.58 


5.16 


13.75 


12.09 


-11.64 


-9.68 


3.72 


-4.85 


7.00 


26.59 


-3.04 


11.17 


-7.40 


-14.08 


21.28 


22.18 


-1.31 


-16.89 


-19.58 


-3.04 


31.81 


22.86 


V-29.80 


-17.27 


22.05 


8.52 


-7.87 


-20.10 


5.16 


11.17 


22.86 


64.68 / 
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